Online-Academy

Look, Read, Understand, Apply

Menu

Data Mining And Data Warehousing

Data Warehouse Architecture

Data Warehouse Architecture

Data warehouse architecture should support the data warehouse processes. The major data warehouse processes are
  • Extract and Load
  • Clean and Transform
  • Backup and Archive
  • Query Processing
To support these processes data warehouse architecture has defined three software components:
    Load Manager
    Warehouse Manager
    Query Manager
ProcessPerformed By
Extract and LoadLoad Manager
Clean and TransformWarehouse Manager
Backup and ArchiveWarehouse Manager
Query ProcessingQuery Manager

Extract and Load

The Data warehouse is populated with the data extracted from the operational sources. We don't directly do data mining in the operational source data, as those data may be inconsistent, redundant, incomplete. So, the operational data is first loaded to the temporary storage by the Load manager. Load manager performs simple transformations like removing unnecessary columns or data from the operational data.

The data loaded into the temporary storage is further cleaned and transformed by the warehouse manager; Warehouse manager makes inconsistent data consistent, removes redundancies, completes incomplete data, transform data to the format acceptable by the data warehouse and data mining modules. In addition to these tasks, the warehouse manager creates an index on the data for fast retrieval.

To address the users' queries several summaries and aggregate tables are created. With the time more and more summaries and aggregate tables are added to the data warehouse; some of the old aggregates table may be of no use, or users are not asking for data from those tables. So, the warehouse manager archives those summaries and aggregate tables and stores in the backup in such a way, those archives can be easily converted to their original form as in future, somebody may ask data related to those summaries also. The process of backup and archive helps to reduce the memory requirement, as those unusable summaries don't have to be stored in memory for live access.

Users use the data warehouse to get answers to their queries. Different users have a different kind of data/information requirements; to address those different data/information requirements several summaries and aggregate tables have to be produced. It is difficult to know in advance which summaries and aggregates are to be made. So, Query manager keeps track of all the users' queries, ranks than, count their frequency, calculate users' active session time, keeps track of mostly used tables, data. This information collected by the query manager helps query manager to respond users quickly; as it knows to which summaries to be used to answer users, query.